4 research outputs found

    Regularisoitu riippuvuuksien mallintaminen geeniekpressio- ja metabolomiikkadatan välillä metabolian säätelyn tutkimuksessa

    Get PDF
    Fusing different high-throughput data sources is an effective way to reveal functions of unknown genes, as well as regulatory relationships between biological components such as genes and metabolites. Dependencies between biological components functioning in the different layers of biological regulation can be investigated using canonical correlation analysis (CCA). However, the properties of the high-throughput bioinformatics data induce many challenges to data analysis: the sample size is often insufficient compared to the dimensionality of the data, and the data pose multi-collinearity due to, for example, co-expressed and co-regulated genes. Therefore, a regularized version of classical CCA has been adopted. An alternative way of introducing regularization to statistical models is to perform Bayesian data analysis with suitable priors. In this thesis, the performance of a new variant of Bayesian CCA called gsCCA is compared to a classical ridge regression regularized CCA (rrCCA) in revealing relevant information shared between two high-throughput data sets. The gsCCA produces a partly similar regulatory effect as the classical CCA but, in addition, the gsCCA introduces a new type of regularization to the data covariance matrices. Both CCA methods are applied to gene expression and metabolic concentration measurements obtained from an oxidative-stress tolerant Arabidopsis thaliana ecotype Col-0, and an oxidative stress sensitive mutant rcd1 as time series under ozone exposure and in a control condition. The aim of this work is to reveal new regulatory mechanisms in the oxidative stress signalling in plants. For the both methods, rrCCA and gsCCA, the thesis illustrates their potential to reveal both already known and new regulatory mechanisms in Arabidopsis thaliana oxidative stress signalling.Bioinformatiikassa erityyppisten mittausaineistojen yhdistäminen on tehokas tapa selvittää tuntemattomien geenien toiminnallisuutta sekä säätelyvuorovaikutuksia eri biologisten komponenttien, kuten geenien ja metaboliittien, välillä. Riippuvuuksia eri biologisilla säätelytasoilla toimivien komponenttien välillä voidaan tutkia kanonisella korrelaatioanalyysilla (canonical correlation analysis, CCA). Bioinformatiikan tietoaineistot aiheuttavat kuitenkin monia haasteita data-analyysille: näytteiden määrä on usein riittämätön verrattuna aineiston piirteiden määrään, ja aineisto on multikollineaarista johtuen esim. yhdessä säädellyistä ja ilmentyvistä geeneistä. Tästä syystä usein käytetään regularisoitua versiota kanonisesta korrelaatioanalyysistä aineiston tilastolliseen analysointiin. Vaihtoehto regularisoidulle analyysille on bayesilainen lähestymistapa yhdessä sopivien priorioletuksien kanssa. Tässä diplomityössä tutkitaan ja vertaillaan uuden bayesilaisen CCA:n sekä klassisen harjanneregressio-regularisoidun CCA:n kykyä löytää oleellinen jaettu informaatio kahden bioinformatiikka-tietoaineiston välillä. Uuden bayesilaisen menetelmän nimi on ryhmittäin harva kanoninen korrelaatioanalyysi. Ryhmittäin harva CCA tuottaa samanlaisen regularisointivaikutuksen kuin harjanneregressio-CCA, mutta lisäksi uusi menetelmä regularisoi tietoaineistojen kovarianssimatriiseja uudella tavalla. Molempia CCA-menetelmiä sovelletaan geenien ilmentymisaineistoon ja metaboliittien konsentraatioaineistoon, jotka on mitattu Arabidopsis thaliana:n hapetus-stressiä sietävästä ekotyypistä Col-0 ja hapetus-stressille herkästä rcd1 mutantista aika-sarjana, sekä otsoni-altistuksessa että kontrolliolosuhteissa. Diplomityö havainnollistaa harjanneregressio-CCA:n ja ryhmittäin harvan CCA:n kykyä paljastaa jo tunnettuja ja mahdollisesti uusia säätelymekanismeja geenien ja metabolittien välillä kasvisolujen viestinnässä hapettavan stressin aikana

    Enhancer prediction in the human genome by probabilistic modelling of the chromatin feature patterns

    No full text
    BACKGROUND: The binding sites of transcription factors (TFs) and the localisation of histone modifications in the human genome can be quantified by the chromatin immunoprecipitation assay coupled with next-generation sequencing (ChIP-seq). The resulting chromatin feature data has been successfully adopted for genome-wide enhancer identification by several unsupervised and supervised machine learning methods. However, the current methods predict different numbers and different sets of enhancers for the same cell type and do not utilise the pattern of the ChIP-seq coverage profiles efficiently. RESULTS: In this work, we propose a PRobabilistic Enhancer PRedictIoN Tool (PREPRINT) that assumes characteristic coverage patterns of chromatin features at enhancers and employs a statistical model to account for their variability. PREPRINT defines probabilistic distance measures to quantify the similarity of the genomic query regions and the characteristic coverage patterns. The probabilistic scores of the enhancer andnon-enhancer samples are utilised to train a kernel-based classifier. The performance of the method is demonstrated on ENCODE data for two cell lines. The predicted enhancers are computationally validated based on the transcriptional regulatory protein binding sites and compared to the predictions obtained by state-of-the-art methods. CONCLUSION: PREPRINT performs favorably to the state-of-the-art methods, especially when requiring the methods to predict a larger set of enhancers. PREPRINT generalises successfully to data from cell type not utilised for training, and often the PREPRINT performs better than the previous methods. The PREPRINT enhancers are less sensitive to the choice of prediction threshold. PREPRINT identifies biologically validated enhancers not predicted by the competing methods. The enhancers predicted by PREPRINT can aid the genome interpretation in functional genomics and clinical studies.Peer reviewe

    ChromDMM: a Dirichlet-multinomial mixture model for clustering heterogeneous epigenetic data

    No full text
    Publisher Copyright: © 2022 The Author(s). Published by Oxford University Press.Motivation: Research on epigenetic modifications and other chromatin features at genomic regulatory elements elucidates essential biological mechanisms including the regulation of gene expression. Despite the growing number of epigenetic datasets, new tools are still needed to discover novel distinctive patterns of heterogeneous epigenetic signals at regulatory elements. Results: We introduce ChromDMM, a product Dirichlet-multinomial mixture model for clustering genomic regions that are characterized by multiple chromatin features. ChromDMM extends the mixture model framework by profile shifting and flipping that can probabilistically account for inaccuracies in the position and strand-orientation of the genomic regions. Owing to hyper-parameter optimization, ChromDMM can also regularize the smoothness of the epigenetic profiles across the consecutive genomic regions. With simulated data, we demonstrate that ChromDMM clusters, shifts and strand-orients the profiles more accurately than previous methods. With ENCODE data, we show that the clustering of enhancer regions in the human genome reveals distinct patterns in several chromatin features. We further validate the enhancer clusters by their enrichment for transcriptional regulatory factor binding sites.Peer reviewe

    Bayesian metabolic flux analysis reveals intracellular flux couplings

    No full text
    Motivation: Metabolic flux balance analysis (FBA) is a standard tool in analyzing metabolic reaction rates compatible with measurements, steady-state and the metabolic reaction network stoichiometry. Flux analysis methods commonly place model assumptions on fluxes due to the convenience of formulating the problem as a linear programing model, while many methods do not consider the inherent uncertainty in flux estimates. Results: We introduce a novel paradigm of Bayesian metabolic flux analysis that models the reactions of the whole genome-scale cellular system in probabilistic terms, and can infer the full flux vector distribution of genome-scale metabolic systems based on exchange and intracellular (e.g. 13C) flux measurements, steady-state assumptions, and objective function assumptions. The Bayesian model couples all fluxes jointly together in a simple truncated multivariate posterior distribution, which reveals informative flux couplings. Our model is a plug-in replacement to conventional metabolic balance methods, such as FBA. Our experiments indicate that we can characterize the genome-scale flux covariances, reveal flux couplings, and determine more intracellular unobserved fluxes in Clostridium acetobutylicum from 13C data than flux variability analysis.Peer reviewe
    corecore